Group 19: Phase 1 - Cats vs Dogs Detector (CaDoD)

Team Members

We are a group of 4 members:

Aishwarya Sinhasane - avsinhas@iu.edu (In picture, Left top)

Himanshu Joshi - hsjoshi@iu.edu (In picture, Right bottom)

Sreelaxmi Chakkadath - schakkad@iu.edu (In picture, Left bottom)

Sumitha Vellinalur Thattai - svtranga@iu.edu (In picture, Right top)

image.png

Project Abstract

Object detection, one of the basic tasks of computer vision, deals with classifying the image based on its content. In recent years, the approach of object detection has evolved rapidly and is being embraced in all fields ranging from healthcare to automotive industries.

The objective of our project is to classify images as either dogs or cats. Additionally, we also plan to find where the cat/dog is in the image. Although the task is simple to human eyes, computers find it hard to distinguish between images because of a plethora of factors including cluttered background, illumination conditions, deformations, occlusions among several others. We plan to build an end-to-end machine learning model which will help computers differentiate between cat and dog images with better accuracy.

The data that we plan to use is the CaDoD Kaggle data set. It contains a total of ~13k images of dogs and cats. We have used Stochastic Gradient Descent, Adaboost, and Gradient boosting to predict the image class. Gradient boosting model gives this highest accuracy and henceforth will be used as our baseline model. Furthermore, linear regression was used for the prediction of the position of cats and dogs

Accuracy and Mean F1 score (harmonic mean of recall and precision) are used as evaluation parameters

Furthermore, we plan to improve the accuracy by implementing a CNN for classifying images and using RIdge/Lasso regression for boundary prediction. In addition to this, we plan to tune the hyperparameter by using GridsearchCV/RandomsearchCV

Project Meta Data

We have completed the following task this week:

Data Description

The data we plan to use is the Kaggle data set. We will be using two files – one for image and the other for boundary:

The images are taken from the cadod.tar.gz

The boundary information of the images is from cadod.csv file

Image information (cadod.tar.gz):

Attributes of the Boundary File (cadod.csv):

-15 numerical features

- This includes the image ID, the coordinates of the boundary boxes, and also the normalized coordinates of the boundary boxes 

-5 categorical features

-This gives information about the occlusion, depiction, truncation, etc. 

Task

The following are the end-to-end task to achieve the results:

Since the data set is very large, we have carried out the above tasks were carried out only in a subset of data. Hence the results will be only directional and not accurate

Import Data

Unarchive data

Load bounding box meta data

Exploratory Data Analysis

Statistics

Replace LabelName with human readable labels

Sample of Images

Correlation Between Features

Image shapes and sizes

Go through all images and record the shape of the image in pixels and the memory size

Count all the different image shapes

There are a ton of different image shapes. Let's narrow this down by getting a sum of any image shape that has a cout less than 100 and put that in a category called other

Drop all image shapes

Check if the count sum matches the number of images

Getting the size of each image

Plot

TODO plot aspect ratio

Preprocess

Rescale the images

Plot the resized and filtered images

Checkpoint and Save data

!pip install tensorflow

!pip install --upgrade pip

Baseline in SKLearn

Load data

Double check that it loaded correctly

Classification

Split data

Create training and testing sets

Take only a portion of the data set for training and testing purporse. We can use the entire data set later

Train

I'm choosing SGDClassifier because the data is large and I want to be able to perform stochastic gradient descent and also its ability to early stop. With this many parameters, a model can easily overfit so it's important to try and find the point of where it begins to overfit and stop for optimal results.

Validation set size of 0.1

Did it stop too early? Let's retrain with a few more iterations to see. Note that SGDClassifier has a parameter called validation_fraction which splits a validation set from the training data to determine when it stops.

Evaluation

Validation set size of 0.2

Adaboost

Gradient Boosting

Regression

Split data

Train

Evaluation

Homegrown implementation

Implement a Homegrown Logistic Regression model. Extend the loss function from CXE to CXE + MSE, i.e., make it a complex multitask loss function the resulting model predicts the class and bounding box coordinates at the same time.

Results / Discussion

Conclusion